Foundations of Data Science
These slides are regularly updated, please always refer to the last version using the provided link.
I will not execute the Python code live, mostly because of time constraints.
However, the coding is an important part of this lecture and I will comment on the most relevant aspects.
The analyses made in these slides are fully reproducible. You are encouraged to:
In the first place, let us load the relevant python packages.
The nltk (Natural Language Toolkit) package is the main tool we will use in this notebook.
You should be familiar with this package with pandas at this stage.
Let us recap some basic notions about strings.
An example of a string is the following:
" symbol instead of ', which is necessary when handling with the apostrophe.+ operator, and repeated using the * operator:This is similar in spirit to the tokenization.
A list containing strings can be joined into a single string, using the following syntax:
Regular expressions, sometimes shortened in “regex”, are a powerful and flexible method for specifying search patterns.
An example of regular expression is ([A-z])+.
To use regular expressions in python, we need to use the package re.
There are several online resources about regex.
If you want to play around with this kind of syntax, you can visit website https://regexr.com.
We will not discuss regular expressions in this notebook, but it is essential to keep in mind that they are often at the core of more advanced functions.
In this notebook, we will analyze a subset of the Large Movie Review Dataset.
This dataset is associated with the paper
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).
The IMDB_small.csv dataset contains only 200 movie reviews. The original dataset is much bigger.
Let us import it into python, using pandas:
5 reviews (out of 200) can be displayed in the following:| review | |
|---|---|
| 0 | One of the other reviewers has mentioned that ... |
| 1 | A wonderful little production. <br /><br />The... |
| 2 | I thought this was a wonderful way to spend ti... |
| 3 | Basically there's a family where a little boy ... |
| 4 | Petter Mattei's "Love in the Time of Money" is... |
| Document | Word 1 | Word 2 | Word 3 | … | Word \(p-1\) | Word \(p\) |
|---|---|---|---|---|---|---|
| Review 1 | \(n_{11}\) | \(n_{12}\) | \(n_{13}\) | \(\dots\) | \(n_{1,{p-1}}\) | \(n_{1p}\) |
| Review 2 | \(n_{21}\) | \(n_{22}\) | \(n_{23}\) | \(n_{2,{p-1}}\) | \(n_{2p}\) | |
| \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | |
| Review \(N\) | \(n_{N1}\) | \(n_{N1}\) | \(n_{N3}\) | \(\dots\) | \(n_{N,p-1}\) | \(n_{Np}\) |
Each \(n_{ij}\) is the number of times the \(j\)th word appears in the \(i\)th review.
This object is sometimes called document term matrix and it is the starting point of most analyses.
This is a deceptively simple problem: in practice, it requires a lot of pre-processing.
A bag of words. What is the implicit assumption behind this representation?
"Encouraged by the positive comments about this film on here I was looking forward to watching this film. Bad mistake. I've seen 950+ films and this is truly one of the worst of them - it's awful in almost every way: editing, pacing, storyline, 'acting,' soundtrack (the film's only song - a lame country tune - is played no less than four times). The film looks cheap and nasty and is boring in the extreme. Rarely have I been so happy to see the end credits of a film. <br /><br />The only thing that prevents me giving this a 1-score is Harvey Keitel - while this is far from his best performance he at least seems to be making a bit of an effort. One for Keitel obsessives only."
It is not the most positive review ever written :-)
Let us focus on the technical aspects at the moment.
There are several weird <br /> symbols, which are HTML tags.
In fact, these movie reviews have been downloaded from the IMDB website.
These tags are not informative, so we need to remove them. A first approach is using regular expressions.
The following command replaces <br /> with a blank space.
"Encouraged by the positive comments about this film on here I was looking forward to watching this film. Bad mistake. I've seen 950+ films and this is truly one of the worst of them - it's awful in almost every way: editing, pacing, storyline, 'acting,' soundtrack (the film's only song - a lame country tune - is played no less than four times). The film looks cheap and nasty and is boring in the extreme. Rarely have I been so happy to see the end credits of a film. The only thing that prevents me giving this a 1-score is Harvey Keitel - while this is far from his best performance he at least seems to be making a bit of an effort. One for Keitel obsessives only."
Albeit useful, the above regular expression fixes only a very specific HTML tag.
To remove all the HTML parts of the text, we need a dictionary.
Here, we make use of the BeautifulSoup package, whose documentation is available online.
from bs4 import BeautifulSoup # Load the package
# Removes the <br /> and other HTML tags
def remove_html(data):
data = BeautifulSoup(data)
return data.getText()
review = remove_html(review)
review"Encouraged by the positive comments about this film on here I was looking forward to watching this film. Bad mistake. I've seen 950+ films and this is truly one of the worst of them - it's awful in almost every way: editing, pacing, storyline, 'acting,' soundtrack (the film's only song - a lame country tune - is played no less than four times). The film looks cheap and nasty and is boring in the extreme. Rarely have I been so happy to see the end credits of a film. The only thing that prevents me giving this a 1-score is Harvey Keitel - while this is far from his best performance he at least seems to be making a bit of an effort. One for Keitel obsessives only."
Another source of concern is the presence of standard English abbreviations, which we want to replace with their extended form.
We can do this by defining our own dictionary.
The following dictionary is by no means exhaustive. Feel free to modify it and add other examples.
def remove_abb(review):
replacements = {
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"could've": "could have",
"couldn't": "could not",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"gonna": "going to",
"hadn't": "had not",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'll": "he will",
"he's": "he is",
"how'd": "how did",
"how'll": "how will",
"how's": "how is",
"I'd": "I would",
"I'll": "I will",
"I'm": "I am",
"I've": "I have",
"isn't": "is not",
"it'd": "it would",
"it'll": "it will",
"it's": "it is",
"Its" : "It is",
"let's": "let us",
"mightn't": "might not",
"mustn't": "must not",
"shan't": "shall not",
"she'd": "she would",
"she'll": "she will",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"that's": "that is",
"there's": "there is",
"they'd": "they would",
"wanna" : "want to",
"We're" : "We are"
}
for key, value in replacements.items():
review = re.sub(r"{}".format(key), value, review)
return review"Encouraged by the positive comments about this film on here I was looking forward to watching this film. Bad mistake. I have seen 950+ films and this is truly one of the worst of them - it is awful in almost every way: editing, pacing, storyline, 'acting,' soundtrack (the film's only song - a lame country tune - is played no less than four times). The film looks cheap and nasty and is boring in the extreme. Rarely have I been so happy to see the end credits of a film. The only thing that prevents me giving this a 1-score is Harvey Keitel - while this is far from his best performance he at least seems to be making a bit of an effort. One for Keitel obsessives only."
We now convert the text to the lower case.
This can be done by using the .lower() method:
"encouraged by the positive comments about this film on here i was looking forward to watching this film. bad mistake. i have seen 950+ films and this is truly one of the worst of them - it is awful in almost every way: editing, pacing, storyline, 'acting,' soundtrack (the film's only song - a lame country tune - is played no less than four times). the film looks cheap and nasty and is boring in the extreme. rarely have i been so happy to see the end credits of a film. the only thing that prevents me giving this a 1-score is harvey keitel - while this is far from his best performance he at least seems to be making a bit of an effort. one for keitel obsessives only."
Tokenization is the task of cutting a string into linguistic units that constitute a piece of language data.
Tokenization is performed using specialized functions, such as the word_tokenize of the nltk python package:
review_tokens = nltk.word_tokenize(review) # Perform tokenization
review_tokens[140:] # Shows the last tokens['one', 'for', 'keitel', 'obsessives', 'only', '.']
word_tokenize fails to recognize it.In our analyses, we wish to focus on words, therefore we delete commas, dots, and other special symbols such as !@#*.
This is a simplifying operation because punctuation might be very informative.
In many languages, there are high-frequency words that have no meaning on their own, such as conjunctions and articles.
These tokens are called stopwords and we wish to eliminate them.
A list of stopwords is conveniently stored in the nltk.corpus package, as shown below
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
Stemming reduces each word to its root, namely deleting suffixes, thus decreasing the dictionary and avoiding token duplications.
Stemming is performed using the SnowballerStemmer function.
Other stemmers are available in the nltk package; please see the documentation for further info.
Stemmers are language-dependent: we need to specify that the reviews are written in English.
review_tokens = [nltk.SnowballStemmer("english").stem(words) for words in review_tokens]
review_tokens[:8]['encourag', 'posit', 'comment', 'film', 'look', 'forward', 'watch', 'film']
from nltk.tokenize.treebank import TreebankWordDetokenizer
detokenizer = TreebankWordDetokenizer()
# Create a "review" from the stemmed tokens
detokenizer.detokenize(review_tokens)'encourag posit comment film look forward watch film bad mistak seen film truli one worst aw almost everi way edit pace storylin soundtrack film song lame countri tune play less four time film look cheap nasti bore extrem rare happi see end credit film thing prevent give harvey keitel far best perform least seem make bit effort one keitel obsess'
# 1st round of pre-processing
def basic_cleaning(review):
review = remove_html(review) # Remove HTML
review = remove_abb(review) # Remove abbreviations
return review
# 2nd round of Pre-processing
def advanced_cleaning(review):
# Basic cleaning (HTML + symbols)
review = basic_cleaning(review)
# Normalization
review = review.lower()
# Tokenization
review_tokens = nltk.word_tokenize(review)
# Special symbols and punctuation
review_tokens = [words for words in review_tokens if words.isalpha()]
# Filtering
review_tokens = [words for words in review_tokens if words not in stopwords.words('english')]
# Stemming
review_tokens = [nltk.SnowballStemmer("english").stem(words) for words in review_tokens]
# Conversion to a single string
review = detokenizer.detokenize(review_tokens)
return review'Petter Mattei\'s "Love in the Time of Money" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what money, power and success do to people in the different situations we encounter. <br /><br />This being a variation on the Arthur Schnitzler\'s play about the same theme, the director transfers the action to the present time New York where all these different characters meet and connect. Each one is connected in one way, or another to the next person, but no one seems to know the previous point of contact. Stylishly, the film has a sophisticated luxurious look. We are taken to see how these people live and the world they live in their own habitat.<br /><br />The only thing one gets out of all these souls in the picture is the different stages of loneliness each one inhabits. A big city is not exactly the best place in which human relations find sincere fulfillment, as one discerns is the case with most of the people we encounter.<br /><br />The acting is good under Mr. Mattei\'s direction. Steve Buscemi, Rosario Dawson, Carol Kane, Michael Imperioli, Adrian Grenier, and the rest of the talented cast, make these characters come alive.<br /><br />We wish Mr. Mattei good luck and await anxiously for his next work.'
'Petter Mattei\'s "Love in the Time of Money" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what money, power and success do to people in the different situations we encounter. This being a variation on the Arthur Schnitzler\'s play about the same theme, the director transfers the action to the present time New York where all these different characters meet and connect. Each one is connected in one way, or another to the next person, but no one seems to know the previous point of contact. Stylishly, the film has a sophisticated luxurious look. We are taken to see how these people live and the world they live in their own habitat.The only thing one gets out of all these souls in the picture is the different stages of loneliness each one inhabits. A big city is not exactly the best place in which human relations find sincere fulfillment, as one discerns is the case with most of the people we encounter.The acting is good under Mr. Mattei\'s direction. Steve Buscemi, Rosario Dawson, Carol Kane, Michael Imperioli, Adrian Grenier, and the rest of the talented cast, make these characters come alive.We wish Mr. Mattei good luck and await anxiously for his next work.'
'petter mattei love time money visual stun film watch mattei offer us vivid portrait human relat movi seem tell us money power success peopl differ situat encount variat arthur schnitzler play theme director transfer action present time new york differ charact meet connect one connect one way anoth next person one seem know previous point contact stylish film sophist luxuri look taken see peopl live world live thing one get soul pictur differ stage loneli one inhabit big citi exact best place human relat find sincer fulfil one discern case peopl act good mattei direct steve buscemi rosario dawson carol kane michael imperioli adrian grenier rest talent cast make charact come wish mattei good luck await anxious next work'
Apply the basic_cleaning and advanced_cleaning to all the reviews.
Create two new variables in the dataset: review_clean and review_token.
# This could take a while
imdb['review_clean'] = imdb['review'].apply(lambda z: basic_cleaning(z))
imdb['review_token'] = imdb['review'].apply(lambda z: advanced_cleaning(z))
imdb.head(2)| review | review_clean | review_token | |
|---|---|---|---|
| 0 | One of the other reviewers has mentioned that ... | One of the other reviewers has mentioned that ... | one review mention watch oz episod hook right ... |
| 1 | A wonderful little production. <br /><br />The... | A wonderful little production. The filming tec... | wonder littl product film techniqu fashion giv... |
# Put everything into a single string
words = ' '.join(imdb['review_token'])
# Create a global tokenization
tokens = nltk.word_tokenize(words)
# Conversion to "text"
text = nltk.Text(tokens)
# Compute the most common words
fdist = nltk.FreqDist(text)
# Use pandas for organizing and displaying the results
df_words = pd.DataFrame(list(fdist.items()), columns = ["Word","Frequency"])
# Order words from the most frequent
df_words = df_words.sort_values(by = "Frequency", ascending = False)
# Dimension of the dataset
df_words.shape(5294, 2)
5294 different stems.| Document | Word 1 | Word 2 | Word 3 | … | Word \(p-1\) | Word \(p\) |
|---|---|---|---|---|---|---|
| Review 1 | \(n_{11}\) | \(n_{12}\) | \(n_{13}\) | \(\dots\) | \(n_{1,{p-1}}\) | \(n_{1p}\) |
| Review 2 | \(n_{21}\) | \(n_{22}\) | \(n_{23}\) | \(n_{2,{p-1}}\) | \(n_{2p}\) | |
| \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | |
| Review \(N\) | \(n_{N1}\) | \(n_{N1}\) | \(n_{N3}\) | \(\dots\) | \(n_{N,p-1}\) | \(n_{Np}\) |
This is now an easy task, because documents (reviews) have been tokenized and stemmed.
In practice, we will make use of the CountVectorizer of the sklearn python package.
The total number of distinct stems we obtained after cleaning is 5294.
We consider only a fraction of them (\(p = 500\)): those having higher frequencies.
from sklearn.feature_extraction.text import CountVectorizer
# Creation of a TDM with p = 500 words
vectorizer = CountVectorizer(max_features = 500)
X = vectorizer.fit_transform(imdb['review_token'])
word_names = list(vectorizer.get_feature_names_out())
# Conversion to dataframe
X = pd.DataFrame(X.toarray())
# Renaming columns according to words
X.columns = word_namesThe CountVectorizer function performs more operations than we need.
For example, it silently converts the text to lowercase. It is also possible to remove stopwords.
These operations are redundant in our case.
Please refer to the official documentation for further details.
| abl | absolut | act | action | actor | actress | actual | age | almost | along | ... | worst | worth | would | write | written | year | yes | yet | young | zombi | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 4 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
8 rows × 500 columns
Sometimes, one might be interested in obtaining a variation of the former TDM, which is based on the so-called term frequency - inverse document frequency.
Each \(n_{ij}\) is the number of times the \(j\)th word appears in the \(i\)th review, for \(i = 1,\dots,N\) and \(j = 1,\dots,p\).
Let us define the following quantity:
\[ N_j = \sum_{i=1}^N I(n_{ij} > 0) = \text{("Number of documents containing the j-th word")}. \]
\[ n_{i \cdot} = \sum_{j=1}^p n_{ij} = \text{("Number of words in the i-th document")}. \]
\[ f_{ij} = \frac{n_{ij}}{n_{i \cdot}}. \]
\[ \text{IDF}_{j} = \log\left({\frac{N}{N_j}}\right). \]
\[ \text{TF-IDF}_{ij} = f_{ij} \times \text{IDF}_j. \]
from sklearn.feature_extraction.text import TfidfVectorizer
# Creation of a TDM TF-IDF with p = 500 words
vectorizer = TfidfVectorizer(max_features = 500)
X = vectorizer.fit_transform(imdb['review_token'])
word_names = list(vectorizer.get_feature_names_out())
# Conversion to dataframe
X = pd.DataFrame(X.toarray())
# Renaming columns according to words
X.columns = word_names| abl | absolut | act | action | actor | actress | actual | age | almost | along | ... | worst | worth | would | write | written | year | yes | yet | young | zombi | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.0 | ... | 0.0 | 0.000000 | 0.109653 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 |
| 1 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.108836 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.0 | ... | 0.0 | 0.144457 | 0.000000 | 0.000000 | 0.166546 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 |
| 2 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.0 | ... | 0.0 | 0.000000 | 0.095167 | 0.000000 | 0.000000 | 0.133460 | 0.0 | 0.0 | 0.151784 | 0.000000 |
| 3 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.0 | ... | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.246393 |
| 4 | 0.0 | 0.0 | 0.077615 | 0.106456 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.0 | ... | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 |
| 5 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.0 | ... | 0.0 | 0.000000 | 0.115505 | 0.000000 | 0.000000 | 0.161981 | 0.0 | 0.0 | 0.000000 | 0.000000 |
| 6 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.0 | ... | 0.0 | 0.000000 | 0.403626 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 |
| 7 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.12045 | 0.0 | ... | 0.0 | 0.000000 | 0.071358 | 0.128576 | 0.000000 | 0.100071 | 0.0 | 0.0 | 0.000000 | 0.000000 |
8 rows × 500 columns
Sentiment analysis is the practice of understanding the overall opinion (sentiment) of a document.
It is arguably a very difficult (sometimes impossible) task, especially in the presence of complex texts.
Here, we showcase a straightforward algorithm for sentiment analysis, based on the idea of scoring, called VADER.
The associated article is:
Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.
The core idea is straightforward: “positive words” are given a positive score, and vice versa with “negative words”.
A human identifies whether these are positive or negative terms.
Then, the scores are weighted, manipulated, and summarized through a large number of heuristics.
Even though VADER is very simplistic, it is quick to compute and is a reasonable starting point for more complex analysis. Let us see it in action:
'A wonderful little production. The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well done.'
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sentiment = SentimentIntensityAnalyzer()
sentiment.polarity_scores(review){'neg': 0.055, 'neu': 0.768, 'pos': 0.177, 'compound': 0.9641}
The compound term is standardized score between \((-1, 1)\), measuring if sentiment is positive or negative.
In this case, the VADER algorithm correctly identifies the sentiment.
"This show was an amazing, fresh & innovative idea in the 70's when it first aired. The first 7 or 8 years were brilliant, but things dropped off after that. By 1990, the show was not really funny anymore, and it is continued its decline further to the complete waste of time it is today.It's truly disgraceful how far this show has fallen. The writing is painfully bad, the performances are almost as bad - if not for the mildly entertaining respite of the guest-hosts, this show probably wouldn't still be on the air. I find it so hard to believe that the same creator that hand-selected the original cast also chose the band of hacks that followed. How can one recognize such brilliance and then see fit to replace it with such mediocrity? I felt I must give 2 stars out of respect for the original cast that made this show such a huge success. As it is now, the show is just awful. I cannot believe it is still on the air."
This is entirely inappropriate. Perhaps the words amazing, fresh and innovative (and others) misled the algorithm.